Hashing ======= Hash Table ---------- BSTs provide add, contains, and remove operations in O(log n) time. Can we create a structure that improves on O(log n) time? Yes Hashing can implement a Set in O(1) per operation Suppose we need to create a set of employee ID numbers, where each ID number is between 1 and 1000. How can we store the numbers so we can add/contains/remove in O(1) time? Make an array of size 1000 and index it with the item value add(i) array[i]++ contains(i) return array[i] > 0 remove(i) array[i]-- Could we use this approach to store items (like Strings) that are not numbers? Yes: convert the strings into numbers Characters are represented by codes (commonly from 0 to 127) Treat a string as a base-128 number Could we use this approach to store larger numbers? Suppose we are storing only 1000 different numbers, but the numbers range in value from 1 to 1000000. Would we make an array of size 1000000? No: make an array closer to size 1000 Use a hash function to map large numbers into smaller ones What is a Hash Table? An array used to store items in a collection Items are stored in the array at the index given by a special function known as a hash function What is a Hash Function? A function that maps an item to a number The number is in the range of a valid array index The function usually contains a mod by table size (x % tableSize) What is a Collision? When two or more items hash to the same number Resolve with chaining, linear probing, quadratic probing Hash Functions -------------- What are the characteristics of a good Hash Function? Gives a number between 0 and tableSize-1 Easy/fast to compute Distributes items evenly throughout the hash table What is a typical hash function for integers? value % tableSize What is a typical hash function for Strings? A hash function for Strings must first convert a String to a number Consider the following function for converting a String to a number public static int hash(String key, int size) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal += key.charAt(i); return hashVal % size; } What is wrong with this code? Same characters in different orders (e.g., tab, bat) produce the same result Gives only small numbers (10 chars give at most 1280) Consider the following function for converting a String to a number public static int hash(String key, int size) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal = (hashVal * 128 + key.charAt(i)) % size; return hashVal; } How is this hash function better than the previous one? Characters are weighted by their positions -> Different orders give different numbers and larger numbers can be produced Consider the following function for converting a String to a number public static int hash(String key, int size) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal = hashVal * 128 + key.charAt(i); hashVal %= size; if (hashVal < 0) hashVal += size; return hashVal; } How is this hash function better than the previous one? The expensive mod operation is not repeated on every loop What problem does this code have with long strings? With long strings the early characters are lost from the answer Consider the following function for converting a String to a number public static int hash(String key, int size) { int hashVal = 0; for (int i = 0; i < key.length(); i++) hashVal = hashVal * 37 + key.charAt(i); hashVal %= size; if (hashVal < 0) hashVal += size; return hashVal; } How is this hash function better than the previous one? The early characters are not lost as quickly What is a typical hash function for Java Objects? The hashCode method maps the object to an integer item.hashCode() % tableSize How do we implement hashCode for classes you write? Combine the hashcodes for the parts of the object that make up the key public class Vehicle { int state; // key String id; // key String make; String model; int year; public boolean equals(Object obj) { if (obj instanceof Vehicle) { Vehicle other = (Vehicle)obj; return state == other.state && id.equals(other.id); } else { return false; } } public int hashCode() { return state + id.hashCode(); } } Collision Resolution -------------------- Linear Probing How do you insert an item in a hash table using Linear Probing? Sequentially scan the array until an empty cell is found Wrap-around from the last position to the first Exercise Show the result of inserting 2, 7, 3, 8 into a hash table of size 5 with linear probing Show the result of inserting 7, 6, 1, 2 into a hash table of size 5 with linear probing How do we find an item in a hash table using Linear Probing? Follow the same sequence as insert If we reach an empty slot, the item is not in the table Exercise Show the result of finding 8 and 9 in the first hash table above How do we delete an item from a hash table using Linear Probing? We must use lazy deletion Mark items deleted Analysis of Linear Probing Load factor lambda = fraction of the table that is full Given a hash table with load factor lambda, what is the probability of hitting an empty cell with one probe? 1 - lambda Assuming that probes are independent (not valid), how many probes (on average) are needed to find an empty slot in the table? 1 / (1 - lambda) 2 probes in a half-full table 10 probes in a 90%-full table Why are the probes not independent? What is Primary Clustering? Large blocks of occupied cells are formed Insertions at indexes in the cluster are expensive Assuming that probes are not independent, how many probes (on average) are needed to find an empty slot in the table? (1 + 1 / (1 - lambda)^2) / 2 2.5 probes in a half-full table 50 probes in a 90%-full table Does Linear Probing work well enough to use in practice? Yes If we keep the load factor low (lambda < 0.5), then linear probing works well If we could remove primary clustering, we'd only save 0.5 probe on insert, 0.1 probe on find What is a good Load Factor to use with Linear Probing? Keep the load factor low, lambda < 0.5 Quadratic Probing How do you insert an item in a hash table using Linear Probing? Scan cells 1, 4, 9 away from the original probe If the hash function gives H, scan H+(i^2) instead of H+i (i = 1, 2, ...) Wrap-around from the last position to the first Eliminates primary clustering Exercise Show the result of inserting 2, 7, 3, 8 into a hash table of size 5 with quadratic probing Show the result of inserting 7, 6, 1, 2 into a hash table of size 5 with quadratic probing Does quadratic probing try all cells? If the table size is prime and the load factor is <= 0.5, then probes will be to different locations Implementation issues Linear Probing is fast and easy (increment, wraparound) Quadratic Probing appears expensive (increment, square, add, wraparound) Can we implement quadratic probing efficiently? increment i (i+1) shift left 1 bit (2(i+1)) subtract 1 (2(i+1) - 1) add to old position test for wrap-around How does adding 2(i+1)-1 to the old position give the same result as H + (i+1)^2? (H + (i+1)^2) - (H + (i)^2) = 2i + 1 = 2(i+1) - 1 What should we do when the load factor exceeds 0.5? Rehashing Make a new table twice as big (but use a prime number) Find the next prime number at least twice as large Do we copy each item from the old table to the same location in the new table? No, the hash function is different because the table size is different Hash each item into the new table How well does Quadratic Probing perform? Eliminates primary clustering Suffers from secondary clustering Elements that hash to the same position probe the same alternate cells Causes less than an extra half-probe per search Can we eliminate Secondary Clustering? Yes, use Double Hashing Use a second hash function for collision resolution H(x) + 0*H2(x) H(x) + 1*H2(x) H(x) + 2*H2(x) H(x) + i*H2(x) Chaining -------- How do we resolve collisions using Chaining? The hash table is an array of linked-lists The hash function tells to which list an item is added and in which list an item can be found Exercise Show the result of inserting 2, 7, 3, 8 into a hash table of size 5 (use the "standard" hash function for integers) What is the Big-Oh bound of add/contains/remove when chaining is used? Worst-case is O(n) (all items hash to the same bucket) Average-case is O(n/m) (items distributed evenly over the m buckets) What is the Load Factor of a hash table? The fraction of the table that is full lambda = n/m What is a good Load Factor to use with chaining? 1.0 Lower does not significantly improve performance Higher can save space Which Hashing or AVL trees is better? Trees support order Hashing is O(1), trees are O(logN) Hashing is simpler Implementing Hashing (Project #7) --------------------------------- You will implement the Set interface (add/remove/contains) using hashing What hash function will you use? Math.abs(item.hashCode % hash_table_size) How will you resolve collisions? Chaining The hash table will be an array of what type? Array of List How will you implement size()? Use an integer variable Increment when items are added Decrement when items are removed How will you implement add()? Use the hash function to get the index Call add on the List at the index How will you implement contains()? Use the hash function to get the index Call contains on the List at the index How will you implement remove()? Use the hash function to get the index Call remove on the List at the index What will be the size the hash table? Initially 101 When will you rehash to a larger table? When the load factor exceeds 1 What size table will you create when you rehash? 2 * hash_table_size + 1 How will you rehash into the new table? Iterate through the old table and add to the new table You will also implement an Iterator (hasNext/next) that iterates over the hash table What data is needed in an Iterator for the hash table? The current index in the array The current position in the List at that index What order must the Iterator use when returning items? Sequentially from index 0 to hash_table_size-1 For each index, give the items in the order they were added How will you implement next()? Move to the next item on the list If at the end of the list, move to the next index How will you implement hasNext()? How will you implement toArray()? Make a new array of the exact size Iterate over the table putting items in the array You will then implement the Map interface (put/get/remove) using the hash set What data is needed? A hash table object What's stored in a Map? Key/value pairs How do you store key/value pairs in a Set? Make a Pair class that contains a key and a value Define the equals and hashCode methods using only the key What's stored in a Pair object? A key and a value How do you implement equals()? Return the result of equals on the keys How do you implement hashCode()? Return the result of hashCode on the key How do you implement clear(), isEmpty(), size()? hash.clear() hash.isEmpty() hash.size() How do you implement get()? Add a find method to the hash table implementation Find returns the object that matches Make a pair with the key and a null value Call find with the pair How do you implement put()? If a pair with the same key is in the table, remove it Save the old value to be returned Make a new pair with the key and value Add the new pair to the hash table How do you implement remove()? Save the old value to be returned Make a pair with the key and a null value Remove the pair from the hash table